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Accurately inferring the state of a quantum device from the results of measurements is a crucial 
task in building quantum information processing hardware. The predominant state estimation pro- 
cedure, maximum likelihood estimation (MLE), generally reports an estimate with zero eigenvalues. 
These cannot be justified. Furthermore, the MLE estimate is incompatible with error bars, so con- 
clusions drawn from it are suspect. I propose an alternative procedure, Bayesian mean estimation 
(BME). BME never yields zero eigenvalues, its eigenvalues provide a bound on their own uncertain- 
ties, and it is the most accurate procedure possible. I show how to implement BME numerically, 
and how to obtain natural error bars that are compatible with the estimate. Finally, I briefly discuss 
the differences between Bayesian and frequentist estimation techniques. 



One of the prerequisites for quantum computing is 
"The ability to initialize the state of the qubits to a sim- 
ple fiducial state, such as |ooo...)" [l]. The device that 
prepares such a state must be tested and characterized, 
either to confirm that it reliably produces |ooo...), or to 
determine what state it does produce, so that it can be 
tuned to emit the desired one. This task, of experimen- 
tally finding a density matrix p to describe the output of 
a quantum device, is quantum state estimation. 

State estimation is more generally useful than it may 
appear. Two of the other quantum computing build- 
ing blocks listed in DiVincenzo's seminal paper [lj (low- 
noise universal gates, and minimal decoherence) refer to 
quantum processes. Quantum process estimation, used 
to characterize gates and decoherence, is mathematically 
equivalent to state estimation [2j. Thanks to quantum 
error correction and fault tolerant design, states (e.g., of 
the ancillae used for error correction) and gates for quan- 
tum computing need not be perfect, nor does the designer 
have to characterize them with infinite precision. They 
must function correctly with probability at least 1 — e, 
where e (the fault tolerance threshold) is thought to be 
somewhere between 10~ 5 [|| and 10 -2 [1]. A procedure 
for state estimation must accurately estimate probabili- 
ties on the order of e, and must provide a reliable bound 
on the uncertainty in the estimate. 

Maximum likelihood estimation (MLE), based on the 
principle that the best estimate is the state p that 
maximizes the probability of the observed data, is 
the current procedure of choice. Unfortunately, it has se- 
rious flaws, ft typically yields a rank-deficient estimate, 
with one or more zero eigenvalues. Such an estimate is 
implausible, implying that some measurement outcome 
is literally impossible. No finite amount of data can jus- 
tify such certainty. More importantly, it is impossible to 
bracket a zero probability with consistent error bars [25j . 
The MLE estimate is at best sub-optimal, and at worst 
dangerously unreliable (implying, for instance, that cer- 
tain errors can be ruled out). 

Bayesian mean estimation (BME) is an alternative 
procedure that avoids these pitfalls. Unlike MLE, which 
seeks a unique maximally plausible state, BME consid- 
ers the other states that are only slightly less plausible. 



The simple underlying principle is that the best es- 
timate is an average over all states p consistent 
with the data, weighted by their likelihood. The 

BME estimate is always full-rank, and comes equipped 
with a natural set of error bars. Moreover, each eigen- 
value A of /5 B me yields an upper bound (AA 2 < A) on its 
own uncertainty. Best of all, BME is provably the most 
accurate scheme possible [f| , under certain reasonable as- 
sumptions. 

The body of this paper is divided into three sections. 
Section |T] explains the problems with MLE. Section |TT] 
presents and analyzes the BME algorithm, along with 
one possible implementation. Section [Ml discusses some 
unsolved problems. 



I. THE STATE OF THE ART 

The oldest and simplest estimation procedure is quan- 
tum state tomoqraphy. In tomography, the estimator re- 
peatedly measures several observables, records the fre- 
quencies of the outcomes, and identifies the outcomes' 
frequencies with their probabilities. Inverting Born's 
Rule yields a unique density matrix p tama that predicts 
these probabilities. The most important problem with 
tomography is that p tomo often has negative eigenvalues, 
which means it cannot represent a physical state. 

In 1996, Hradil proposed maximum likelihood estima- 
tion (MLE) as a more flexible and sophisticated approach 
Q. An estimate p is a theory about the unknown state. 
Statisticians define the likelihood of a theory, C(p) as the 
probability that the theory (p) would have predicted for 
the observed data (A4), before the experiment took place: 



C(p)=p(M\p). 



(1) 



Thus, /5mle is simply the p that maximizes £(•) - i.e, the 
most "likely" state. C(p) is not a probability distribution 
over p, so /We is not in any well-defined sense the most 
probable state. Actually finding /5 MLE requires numerics, 
but several algorithms exist [7]. MLE was successfully 
applied in 2001 to a quantum optics experiment Q, and 
has been used extensively since then. 
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MLE has some critical flaws. The most visible is that 
/5 M le can be rank-deficient. If |v>) is the eigenstate cor- 
responding to a zero eigenvalue, then (ip\ p |v>) = 0. Such 
an estimate, though perhaps not unphysical, is implausi- 
ble - i.e., no experimentalist would believe it. It predicts 
exactly zero probability for every measurement outcome 
|j/>)(v>| such that (ip\p\tp) = 0. This implies absolute cer- 
tainty that \i>)(ip\ will not be observed, which cannot be 
justified by a finite amount of data. If N observations 
with d possible outcomes are available, then the lowest 
defensible probability estimate for any event is roughly 

M 

pmin ~ ivTd' (2) 

This is a practical concern. One of the seminal pa- 
pers on MLE Q estimated the polarization state of two 
entangled photons, produced by parametric downconver- 
sion. The estimated 4x4 density matrix has 2 eigenvalues 
that are exactly zero. More recently, MLE was used to 
estimate the entangled state of 8 ionic qubits in a trap 
Q. Of 256 eigenvalues, more than 200 are less than l/N 
(about 10 6 measurements were made), and at least 80 
are zero (to within machine precision). 

Zero eigenvalues are just the most extreme illustration 
of a more general problem; /5 ML e implies predictions that 
cannot be justified by the data. After 100 observations, 
p = 10 -8 is no more credible than p — 0. To put it 
another seriously might be exceedingly 

embarrassing in light of further data. Viewed this way, 
zero eigenvalues are just a symptom of the larger prob- 
lem. However, they motivate some useful questions that 
lend structure to this analysis: 

1. Why are zero eigenvalues a problem? 

2. Why does MLE produce zero eigenvalues? 

3. What is the underlying problem with MLE? 



A. Why are zero eigenvalues a problem? 

A quantum state is nothing more or less than a predic- 
tion of the future. Like a classical probability distribu- 
tion, it predicts probabilities for all measurements that 
could be performed. A state estimate is the estimator's 
best prediction of what future experimentalists will find 
when they observe a copy of the estimated system. We 
should therefore evaluate an estimate on how well it pre- 
dicts the future. 

Quantitative evaluation of an estimate is subject to de- 
bate. Statisticians disagree about how to interpret even 
a simple statement about a coin flip: "The probability 
of observing 'tails' is ptaiis = §•" However, it is indis- 
putable that "ptaiis = 0" implies that "tails" will never 
be observed, and that "ptaiis = 1" implies that nothing 
but "tails" will ever be observed. Such a statement has 



the force and status of a mathematical theorem, just like 
"There is no largest prime number." 

An estimator should hardly claim "ptaiis = 0" just be- 
cause she has never observed "tails". She has a finite 
number of observations to work from, and ptaiis might 
simply be very small. For example, if the data comprise 
a single flip, then at least one of the possible outcomes 
will never have been observed, but this does not justify 
asserting that it will never occur. Even if a dozen tri- 
als all yield heads, "ptaiis = 0" is unjustified. No matter 
how many data points the estimator has, we can always 
imagine a much larger dataset in the future, which might 
[embarrassingly] debunk the prediction "p = 0" . Thus, 
data can never justify reporting p = 0. Only prior knowl- 
edge, such as an impossibility theorem, can do so. 

One might object that an estimate carries with it an 
implied uncertainty. For instance, p = 0.5 is clearly 
a decent estimate of p = 0.51; why is p — not an 
equally good estimate of p = 0.01? The reason is that 
zero probabilities are not compatible with any 
error bars [27j • The estimate p = 0.5 could mean 
p = 0.5±0.01, meaning "p is probably between 0.49 and 
0.51." To report p = ± 0.01, however, is nonsensical. 
This would mean "p is probably between -0.01 and 0.01," 
but because p must be non-negative, an unconditionally 
better description is "p is probably between and 0.01," 
or p = 0.05 ± 0.05. 

This is not the only way of representing "p is proba- 
bly between and 0.01." If the estimator's confidence is 
skewed toward one side of the interval, then the best p 
might not be at its center. However, it should necessarily 
be within the interval, not on its boundary. An estimate 
on the boundary can't be optimal, because moving the 
estimate inside the boundary by some tiny e improves it 
(even if the optimal e is unknown). Since p = is on 
the boundary of any interval, p = is only optimal when 
the confidence interval has zero width. Taken seriously, 
a zero probability thus implies both: (1) absolute cer- 
tainty about the outcomes of future measurements; and 
(2) absolute certainty about the probability itself. 

This has practical consequences. If we accept that zero 
probabilities are implausible, then each zero eigenvalue in 
p should be replaced by a small, but finite, e. This poses 
two substantial problems. First, what is e? It clearly 
declines with N, but whether it should scale as l/N or 
l/\fN is unclear. Moreover, when statistics from many 
distinct observables are collated, it's not clear what N 
is. Second, how does "fixing" p's small eigenvalues affect 
its large eigenvalues? Since Trp = 1 is fixed, increasing 
many small eigenvalues will require decreasing the largest 
ones. These large eigenvalues are critical to most of the 
quantities of interest - entanglement, gate fidelity, etc. 
The only way to resolve this messy situation is to avoid 
zero eigenvalues in the first place. 
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B. Why does MLE produce zero eigenvalues? 

The zero eigenvalues in /5 ML e are connected to the 
negativity of tomographic estimates. What I will show 
in this section is that, for a given dataset, if p tomo 
is not positive, then p MLE is rank-deficient. On 
the other hand, if the tomographic estimate is positive, 
then /3 ML E = ft r „- MLE is thus a sort of "corrected 
tomography" [28j . 

The valid state-set, comprising all positive density ma- 
trices, is a convex subset of Hilbert-Schmidt space, the 
(d 2 — l)-dimcnsional vector space of Hermitian, trace- 
1 matrices. Its boundary comprises the rank-deficient 
states. Whenever /5 tomo lies outside this boundary, MLE 
squashes it down onto the boundary, producing a rank- 
deficient estimate. To demonstrate the connection, we 
begin by reviewing tomography. 




1. How tomography works 

Quantum state tomography is based on inverting 
Born's Rule: If a POVM measurement V = 
{Ei . . . Em} is performed on a system in state 
p, then the probability of observing Ei is pi — 

Tr(Eip). The probabilities for d 2 linearly independent 
outcomes single out a unique p tomo consistent with those 
probabilities. Several projective measurements (at least 
d + 1) can, in aggregate, form a quorum - i.e., provide 
sufficient information to identify /5 tomo . 

Note, however, that no measurement can reveal the 
probability of an event. Repeated measurements yield 
frequencies, from which the tomographic estimator infers 
probabilities. The measurement is repeated N times, and 
if outcome Ei appears rii times, we estimate pi = rii/N. 
If the measurements form a quorum, then the equations 

77 

Tr{p tom E i ) = ^ (3) 

can be solved to yield a unique /5 tomo . 

Tomography thus seeks a density matrix whose predic- 
tions agree exactly with the observed frequencies. Unfor- 
tunately, this matrix is not always a state. Suppose that 
an experimentalist, estimating the state of a single qubit, 
measures a x , a y , and o~ z - but only one time each! Hav- 
ing observed the +1 result in each case, she seeks a /5 tomo 
satisfying (a x ) = (a y ) = (a z ) = 1. Such a matrix exists, 

A- = ( i J ) • < 4 > 

but it has a negative eigenvalue A = ~ —0.207. 

Moreover, this "state" implies that all three spin mea- 
surements would be perfectly predictable, which is im- 
possible. 

Estimating the state from a single measurement of each 
basis is a rather extreme example. However, it illus- 
trates a point. Tomography, in a single-minded quest 



FIG. 1: A cross-section of the "Bloch cube", which contains 
all the possible tomographic estimates, and circumscribes the 
Bloch sphere containing all positive estimates. The points 
shown are possible tomographic estimates for N = 11 mea- 
surements each of a x and a z , with (a y ) set to zero for sim- 
plicity's sake. Of the 144 /5 t0 mo shown, 54 are non-positive 
(keep in mind that the a y dimension is ignored). Depending 
on the state, some p tomo will of course be more likely than 
others; this figure merely illustrates the array of possible non- 
positive estimates. 



to match Born's Rule to observed frequencies, pays no 
attention to positivity. As the number of measurements 
(N) increases, the possible tomographic estimates form 
an N x N x N grid. They fill a "Bloch cube," defined 
by {o~ x .y,z) £ [— which circumscribes the Bloch 
sphere and contains a lot of negative states (see Fig. [1]). 
If the true state is sufficiently pure, then the probability 
of obtaining a negative estimate can remain as high as 
50% for arbitrarily large N, since the true state is very 
close to the boundary between physical and unphysical 
states. 

In larger Hilbert spaces, the problem gets worse for 
two reasons. First, the state-set's dimensionality (and 
therefore the number of independent parameters in p) 
grows as d 2 — 1. In order to keep the RMS error (A2 = 
yTr [(p tomo — p) 2 ]) fixed, N must grow proportional to d. 
Second, p t0 mo has more eigenvalues, so the probability of 
at least one negative eigenvalue grows with d (for fixed 
A 2 ). Together, these scalings ensure that tomographic 
estimates of large systems are rarely non-negative. 

The problems with tomography are well known - neg- 
ative eigenvalues were precisely the embarrassing feature 
that motivated MLE. As we shall see, however, MLE's 
implausible zero eigenvalues are closely related to tomog- 
raphy's negative ones. 
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2. How MLE works 

MLE, though sometimes complex in implementation, is 
very simple in theory. Given a measurement record M. = 
{Mi, M 2 , M 3 . . . M N } (where is a positive operator 
representing the ith observation) , the estimator seeks the 
maximum of the likelihood function, 

C(p)=p(M\p) = ]J(Tr[M i p]). (5) 

i 

M. can be compactly represented as a list of frequen- 
cies. Define a set V = {Ei . . . E m } containing all possible 
outcomes, and let m be the number of times that Ei ap- 
pears in M.. Then M. ~ . . . n m }. As N increases, 
the frequency representation of M. remains short. 

Finding /5 MLE is feasible because £(p) has two conve- 
nient properties. First, it is non-negative, so we can 
maximize log(£(p)). Second, log(£(p)) is convex. The 
proof is quite simple: we observe that log(£(p)) = 
Y^i logTr[Mj/3]; that Tr[M;p] is a non-negative, linear 
function of p; that the logarithm of a linear function is 
convex; and that the sum of convex functions is convex. 
Among other things, this means that C(p) has a unique 
local maximum. 



3. The relationship between tomography and MLE 

The likelihood function has another elegant property: 
If there is a state /5 tomo , such that the probability 
predicted for every outcome is equal to its ob- 
served frequency, then p tomo is the maximum of 

C{p). To prove this, let us write log£(p) in terms of (a) 
the observed frequencies (fj = and (b) the predicted 
probabilities (pj = Tr[Ejp\) for all the Ef 





= H(Tr[Mip]) = nwr 

i 3 


(6) 


log(£(p)) 


= ^nj-logC&f^]) 

j 


(7) 
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(8) 




= Nj2lf3 log fj-Ui log/i- 


fj log Pj)] 
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= -N[H(f) + D(f\\p)}. 


(9) 



The last line invokes two information-theoretic quanti- 
ties, entropy H(-) and relative entropy D(-\\-). H(f) 
doesn't depend on p, so it is irrelevant for maximization. 
The relevant quantity is D(f\\p), which is always non- 
negative, and uniquely zero when p = f. Thus, log(£(p)) 
is uniquely maximized when p = f. □ 
So, if Ptomo is a valid state, then /5 MLE = p tomo . What 
if Ptomo exists, but is not a valid state? It must still 
be Hermitian and have unit trace. Furthermore, it pre- 
dicts non-negative probability for each Mi observed, so 




FIG. 2: An example of a likelihood function (for a sin- 
gle qubit) whose unconstrained maximum lies outside the 
state-set, and whose constrained maximum therefore lies on 
its boundary. The domain shown here is a cross section 
of the Bloch sphere, with (a y ) — 0. This particular like- 
lihood function is obtained from 16 measurements each of 
a x and a z , comprising 14 ji) and j+) results, and 2 |o) 
and |-} results. The unconstrained maximum of C(p) is 
at p tomo = | (H + §03 + |°"z)) which has a negative eigen- 
value. The constrained maximum is at /3 M le = where 
|*> = -7-1— (|i> + |+». 



Tr[Eip tomo ] > for all i. The hyperplanes Tr [Erf] = 
define a polytope in Hilbert-Schmidt space - a simple ex- 
ample is the "Bloch cube" referred to previously - which 
contains /5 tomo . 

If we extend the domain of C{p) to this polytope and 
its interior, then its maximum must coincide with p tomo , 
since p tomo predicts the correct frequencies. Tomography, 
in other words, is essentially unconstrained MLE. 

Because C(p) has a unique local maximum at p tomo , its 
maximum over a closed region which does not contain 
Ptomo must lie on the boundary of that region (see Fig. 
[2]) . The set of non-negative density matrices is precisely 
such a closed region, so whenever p tolao is not a valid state, 
Pmle must lie on the boundary of the state-set. That is, 
it will be rank-deficient. 

MLE and tomography are thus variants of the same 
procedure, distinguished only by the positivity constraint 
[29| . MLE is a sort of minimal fix for tomography, return- 
ing the non-negative state that is in some sense "closest" 
to ptomo- Actually computing the number of zero eigen- 
values in Pmle seems difficult, but nu merical exploration 
for 1, 2, 3, and 4 qubit problems [lj| suggests that p MLE 
usually has at least as many zero eigenvalues as p tomo has 
negative ones. In conjunction with the observation that 
large-system tomography tends to yield many negative 
eigenvalues, this explains the many zero eigenvalues in 
experimental applications of MLE. 
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C. What is the underlying flaw? 

Tomography and MLE maximize £(/5), over different 
domains. They display the same pathology, implying un- 
justifiable (zero or negative) probabilities. The underly- 
ing problem is simple: maximum likelihood meth- 
ods are frequentistic; they interpret observed fre- 
quencies as probabilities. By maximizing C(p), they 
seek to fit the observed frequencies as precisely as pos- 
sible. If there exists a p that fits the data exactly, then 
that is always the best estimate. 

The point of state estimation, however, is not solely 
to explain the data. Rather, it is to find a state that 
will predict the future. It should concisely describe what 
the estimator knows about the system being estimated. 
Mindless data fitting accomplishes only retrodiction, of 
the past. The best description of the past (i.e., data) 
probably does not describe the estimator's knowledge, 
especially her uncertainty. 

For example, consider estimating the bias of a coin af- 
ter flipping it just once. The best fit to the data is to 
assume that the coin always conies up the same way. 
This clearly does not describe the estimator's knowledge 
- an honest description would perhaps include the word 
"scant". Ironically, it is the high entropy of the esti- 
mator's knowledge that causes a spuriously low-entropy 
estimate. 

The problem with MLE is that it matches probabili- 
ties to observed frequencies, consistent with frequentist 
statistics. This is actually unfair to frequentism, which 
begins by defining probability as the infinite-sample-size 
limit of frequency. A true frequentist avoids making 
statements about probabilities in the absence of an in- 
finite ensemble, so applying a frequentist method to rela- 
tively small amounts of data is inherently disaster-prone. 
Nonetheless, this is precisely what is happening in MLE. 
For further discussion, see Section Hi Dl 



3. "Optimize accuracy." Obviously, the estimate p 
should be close to the true p. How do we eval- 
uate this? Quantum strictly proper scoring rules 
[H yield a class of metrics designed to measure 
this closeness, called operational divergences. BME 
uniquely minimizes the expected value of every op- 
erational divergence. 

Each of these motivations illustrates one of BME's major 
advantages. The estimate predicts reliable probabilities 
for all measurement outcomes, it comes with a free set 
of error bars, and it is (on average) the most accurate 
estimate that can be made from the data. 

Bayesian approaches have been previously discussed 
in various contexts. Helstrom [1JJ applied Bayesian 
methodology extensively to estimation. He considered 
a variety of utility functions, especially the rather patho- 
logical (5-function utility that motivates MLE, without 
pay ing particular attention to the posterior mean. Jones 
[12j | applied Bayesian inference with Haar measure, fo- 
cusing on information-theoretic bounds. Derka et al fl3j 
examined Bayesian estimation in some detail, primarily 
in its connections to tomography and maximum-entropy. 
Schack et al [l4| formalized a deep connection to ex- 
changeable (deFinetti) states. More recently, Neri [l5[ 
considered Bayesian estimation of phase difference in co- 
herent light. Tanaka and Komaki [l6[ proved the op- 
timality of Bayesian estimation with respect to relative 
entropy. 

My goal in this section is to propose BME as a prac- 
tical procedure for state estimation, and to describe its 
operational advantages. I begin by concisely presenting 
the BME algorithm, then discuss in Section III Bl how it 
can be implemented. Section III CI analyzes the proper- 
ties of Pbme, focusing on the three advantages asserted 
above. Finally, Section III Dl contrasts the Bayesian and 
frequentist approaches. 



II. BAYESIAN MEAN ESTIMATION 

Bayesian methods provide a different perspective on 
statistics. The procedure presented here, Bayesian mean 
estimation (BME), avoids the pitfalls of MLE. Here are 
three basic tenets, each of which independently motivates 
BME: 

1. "Consider all the possibilities." MLE identifies 
the best fit to observed data, but many nearby 
states are almost equally likely. An honest estimate 
should incorporate these alternatives. 

2. "Demand error bars." The estimate should be com- 
patible with error bars, e.g. p ± Ap. This implies 
a ball containing most of the plausible states, of 
size Ap, with p somewhere around the center. If 
p is rank-deficient, no such region exists. Thus, p 
should lie far enough from the state-set's boundary 
to be compatible with well-motivated error bars. 



A. The BME Algorithm 

Bayesian mean estimation is conceptually simple. 

1. Use the data to generate a likelihood function, 
C(p) = p(A4\p). C is not a probability distribu- 
tion; it quantifies the relative plausibility of differ- 
ent state assignments. 

2. Choose a prior distribution over states, iro(p)dp. It 
represents the estimator's ignorance, and should 
generally be chosen to be as "uniform" , or unin- 
formative, as possible. 

3. Multiply the prior by the likelihood, and normalize 
to obtain a posterior distribution 

TT f (p)dp oc C(p)TT (p)dp, (10) 

which represents the estimator's knowledge. The 
proportionality constant is set by normalization. 
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4. Report the mean of this posterior, 

Pbme = J pn f (p)dd. (11) 

This is the best concise description of the estima- 
tor's knowledge. 



B. Implementation 

In practice, BME comes down to computing an inte- 
gral. The best way of doing this remains uncertain, as 
does the existence of an exact solution. The numerical al- 
gorithm presented below has been demonstrated to work 
well in a small variety of cases. However, it could be im- 
proved in many ways, and has some glaring deficiences. 
This algorithm should thus be taken as a proof of prin- 
ciple (i.e., it's possible to do Bayesian estimation) rather 
than an optimal approach. 

An important observation for any integration proce- 
dure is that the likelihood is easy to compute. C(p) 
is the probability of observing a sequence of outcomes 
M. = {M\ . . . Mn}, given p. This is the product of the 
probabilities for the individual Mj, each of which is given 
by Born's rule: 

C(p) = r Tr(M 1 p)Tr(M 2 p)Tr(M 3 p) . . . Tv(M N p) (12) 

When M. is represented using frequencies {Ei was ob- 
served rii times, for i £ [1 . . . m]), this can be evaluated 
in 0(m) time: 



The precise form of the rule is unimportant; it is usu- 
ally stochastic, although a deterministic rule (traversing 
a quasi-random set) is conceivable. What is important 
is that J should generate the underlying measure dp: for 
any p , the set {J n (p ) : n 6 [0...JV]} should sample 
uniformly from dp as N — > oo. For example, the follow- 
ing rule implements Lebesgue measure over the interval 
[0 ... 1]: J(x) — (x + y) mod 1, where y is selected from a 
Gaussian distribution with zero mean and fixed variance. 
Such a rule, unmodified, would compute J f(p)Ap. To 

average instead over £(p)7To(p)dp, we modify the rule as 
follows. After choosing p' , but before jumping to it, we 
compute the likelihood ratio 



c(p')Mp') 
£(pKo(p) ' 



(14) 



C(p) = Tr(E lP )?Tr(E 2 p)Z . . .Tr(E m p)^ 



(13) 



If r > 1 (p' is more likely than p) then we jump as before. 
If not, then we jump to p' with probability r, and stay 
at p (adding it, once again, to the running total) with 
probability 1 — r. 

This biasing ensures that the algorithm spends more 
time at more likely spots, and tends to lurch uphill into 
regions of high probability. Unlike a gradient algorithm 
(as might be used for MLE) , it does not actively seek the 
point of highest probability; jumping to a region of lower 
probability is both possible and necessary. Detailed dis- 
cussion and explanation of why this works can be found 
in IH. 



2. Applying Metropolis-Hastings to quantum states 



1. The Metropolis-Hastings algorithm 

In the absence of an analytic solution to the integral, 
we fall back to numerical Monte Carlo methods. Because 
C{p) is usually a sharply peaked function over a high- 
dimensional space, brute-force random sampling will con- 
verge extremely slowly. Metropolis algorithms [l7[ were 
conceived for precisely such situations. A variant known 
as Metropolis-Hastings [l8|, [l9[ is commonly used for 
Bayesian estimation, and can be adapted straightfor- 
wardly to quantum states. 

The Metropolis-Hastings algorithm computes the av- 
erage value of a function (in this case, p) over an inte- 
gration measure (in this case, £(p)7To(p)dp). It leverages 
the fact that £(p)7To(p)dp is typically dominated by a 
small region of high likelihood. Whereas basic Monte 
Carlo methods jump randomly around the integration 
measure, Metropolis-Hastings makes local, biased jumps. 
This samples the most relevant parts of the sample space 
preferentially. After each jump, the current value of p is 
added to a running tally. This tally, divided by the total 
number of jumps, becomes the final average. 

To implement Metropolis-Hastings, we begin with a 
rule J for jumping from any p to a nearby p' = J(p). 



The heart of the algorithm is the rule J. It determines 
dp, and its form is critical to the algorithm's performance. 
Different underlying measures will require different rules. 
Measures with some claim to "uniformity" are usually 
invariant under a symmmetry group. The natural group 
for quantum states on a d-dimensional Hilbert space is 
SU(d), and the measure that this group induces over pure 
states is called Haar measure. 

A sensible prior should extend over all possible states, 
so we need measures extending to mixed states. How- 
ever, there is no uniquely suitable measure over mixed 
states, because their spectral degrees of freedom (eigen- 
values) have no obvious symmetry. One appealing class 
of measures, proposed by Zyzkowski and Sommers [20| . 
is the set of induced measures, denoted here by dy?. They 
are obtained by beginning with Haar measure on a d x k 
dimensional system, then tracing out the ancillary fac- 
tor. Thus, dip is simply the Haar measure on pure states; 
while d ( p is the Hilbert-Schmidt measure (Lebesgue mea- 
sure on the vector space of Hermitian d x d matrices). 

These induced measures are easy to implement. In- 
stead of keeping track of p itself, we generate and track 
a pure state \ipdxh) in d x k dimensions. At each step, 
p is obtained by tracing out part of \ipdxk)(i>dxk\- The 
ancillary degree of freedom acts as a sort of hidden vari- 
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able, internal to the algorithm. We need only a rule J to 
implement Haar measure over the larger Hilbert space. 

This could be done in many ways - for instance, at 
each step, we could generate a random unitary from Haar 
measure. This has two huge drawbacks. First, the jumps 
are nonlocal, which negates the key advantage of the 
Metropolis-Hastings algorithm. Second, generating and 
applying a random unitary is computationally expensive. 
Instead, we need a relatively small set of efficiently con- 
structable unitaries that generate the entire group. 

Here is an efficient local random walk rule that gener- 
ates Haar measure on a d-dimensional Hilbert space: 

1. Choose a direction, by generating two random inte- 
gers [0 . . . d — 1]. Select a Hermitian operator 
Hij that acts only on the , \j)} subspace. Define 
Hij — a z if i = j, H = a x if i < j, and H — a y if 
i > j. 

2. Choose a distance, 5, from a distribution (e.g., 
Gaussian) with (5) = and (S 2 ) = A 2 . We will 
discuss the choice of A below. 

3. Let J{\ip}) = e lHij 8 Since U acts nontrivially 
only on the |i) , \j) subspace, this can be done very 
easily and quickly. 

Each step's distance is chosen randomly to ensure uni- 
form sampling - with a fixed stepsize, it is just barely 
conceivable that this algorithm might trace out a discrete 
lattice of states. The average step size A is important: if 
A is too large, the algorithm will not find small regions 
of high probability efficiently; if A is too small, it will ex- 
plore the space very slowly. The optimal A will depend 
on L(p), and there is no way to identify it a priori. 

The algorithm must therefore vary A dynamically, 
with feedback. If A is very, very small, then almost every 
jump will be accepted, whereas if A is large, very few will 
be. A good heuristic is that the acceptance ratio should 
be around 60% [2l| (other values are also suggested 
The algorithm should track the acceptance ratio over the 
last ~ 1000 jumps, gradually changing A as appropriate 
to maintain it around 60%. 

Dynamically adjusting the step size like this can, in 
theory, break the convergence properties of the algo- 
rithm. This occurs if the distribution over states is mul- 
timodal; the step size is reduced in order to explore one 
narrow peak in detail, and a far-off peak becomes in- 
accessible. Fortunately, the likelihood function itself is 
guaranteed to be log-convex, and therefore uni-modal. 
For well-behaved priors with convex support (e.g., the 
Hilbert-Schmidt prior, djp), this means that £(p)iro(p) 
can safely be sampled this way. 

Other priors - in particular, the Haar prior, which is 
interesting as a limiting case - do not have convex sup- 
port. These priors will yield multimodal posterior dis- 
tributions. How to effectively and reliably sample from 
such distributions is an outstanding problem. Repeat- 
ing the sampling many times, with randomly distributed 
starting points, is not reliable. It fails badly if two similar 



peaks in the distribution have unequally sized regions of 
convergence; the peak with the larger convergence region 
will be relatively oversampled. 

C. [Good] Properties of the BME estimate 

Why should an experimentalist use Bayesian mean es- 
timation? After all, BME (via Monte Carlo) is more com- 
putationally intensive than MLE. The answer, of course, 
is that BME provides a better estimate than MLE. Specif- 
ically: (1) Pbme's eigenvalues are never unjustifiably small 
(or zero); (2) the procedure can easily be made to yield 
well-motivated error bars that are compatible with p BME ; 
(3) BME is, in a particular sense, the most accurate pos- 
sible estimate - not just asymptotically, but for finite N. 

1. The estimate is plausible 

The first objection to MLE is that /5 MLB is implausible; 
it can (and often does) have zero eigenvalues, which im- 
ply an unjustified certainty. Any alternative procedure 
should yield a strictly positive estimate. BME yields just 
such an estimate, subject to a very weak restriction on 
the prior. 

Consider a simple and illustrative example in classi- 
cal estimation. We estimate the bias 6 of a coin, which 
comes up "heads" with probability b, and "tails" with 
probability 1 — 6. 

Flipping the coin N times yields a measurement record 
consisting of n heads and N — n tails. The likelihood 
function is 

C(b) = b n (l - b) N ~ n , (15) 

and so the MLE estimate is 

t n 

Omle = jf- (16) 

If n — or n — TV, then 6mle will assign zero probability 
to observing either "heads" or "tails" , respectively. 

If we adopt a Bayesian approach, then we must choose 
a prior - e.g., the uniform prior with respect to Lebesgue 
measure, no(b)db = db. The mean of the posterior is an 
integral of the likelihood, and we get: 

n + 1 , 
6bme = j^. (17) 

Since < n < N, the Bayesian estimator never assigns 
zero probability to anything. The lowest possible proba- 
bility assignment for either heads or tails is p m i n = jvT2 ■ 
With no data at all, the Bayesian assigns p = ^ to both 
outcomes; after a single flip, she assigns p = | to the 
outcome that was observed, and p = ^ to the other. 

This is the property that we want in a quantum estima- 
tion procedure. The probabilities assigned to unobserved 
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events are not only nonzero, but also sensible - after N 
trials, it's reasonable to assume that the probability of an 
as- yet-unobserved outcome is at most 1/N, and to assign 

Pmin ~ jy' 

However, this property depends on the prior. Consider 
the prior iro(b) — ^ (S(b) + S(l — &)). After one observa- 
tion of "tails," our Bayesian estimate would be 6bme = 0, 
which is implausible. The problem is that a finite num- 
ber of observations (one) ruled out every b in the support 
of 7To that ascribed nonzero probability to "heads." 

The situation gets even worse if the next observation 
is "heads." The data now rule out every hypothesis, the 
posterior ttj (p)dp vanishes entirely, and the Bayesian pro- 
cedure simply fails. This stems from a contradiction. A 
prior over states implies a probability distribution over 
observations as well, ttq assigned exactly zero probabil- 
ity to Ai = { "heads", "tails"} - which was then observed, 
causing a contradiction. 

The following statements about a prior ttq are logically 
equivalent: 

(a) 7To assigns zero probability to some [finite-length] 
measurement record. 

(b) Bayesian estimation using itq will, for some measure- 
ment record, yield an estimate with a zero probabil- 
ity. 

(c) There exists a measurement record that will annihi- 
late ttq, so that Bayesian estimation fails completely. 

Let us define a fragile prior as one for which these 
statements hold (and which can therefore yield a rank- 
deficient estimate). An estimator should choose a ro- 
bust (i.e., not fragile) prior, which in turn guarantees a 
full-rank estimate. 

In classical probability estimation, avoiding fragility is 
simple: A prior is robust if and only if it has support 
on the interior of the probability simplex. States in the 
interior do not predict zero probability for any observa- 
tion. They can never be ruled out, so a prior supported 
on one can never be annihilated by the data. Conversely, 
every prior supported only on the boundary will be an- 
nihilated by a measurement record that includes every 
possible outcome. 

Intriguingly, this condition does not extend to the 
quantum problem. Support on the interior (i.e., on the 
full-rank states) is sufficient, but not necessary, for ro- 
bustness. Consider estimation of a single qubit using the 
Haar prior, which is restricted to (and uniform over) the 
pure states. Each observation rules out, at most, a single 
pure state - if |o)(o| is observed, then the true state cannot 
be There are uncountably many distinct candidate 

pure states, which means that no [finite-length] measure- 
ment record can annihilate the prior. The Haar prior is 
robust. 

As a general rule, just about every prior that a halfway- 
sane estimator would pick is, in fact, robust. Not only 
the Haar prior (which implies absolute certainty that p is 



pure), but much more extreme priors, such as an equato- 
rial distribution on the Bloch sphere - or, for that mat- 
ter, any continuous curve on the Bloch sphere's surface 
- are robust. Appendix lAl demonstrates a necessary and 
sufficient condition. 



2. The estimate comes with natural error bars 

Another objection to the MLE procedure is that p MLE 
is not, in general, compatible with any error bars. This 
is an obvious consequence of zero eigenvalues; error bars 
imply a region of plausibility surrounding the point esti- 
mate. When the estimate lies on the state-set's bound- 
ary, no such region can exist - in order that /5 M le be in 
its interior, the region would have to contain negative 
matrices. 

The BME estimate is always full-rank, which is en- 
couraging. This in itself does not guarantee compatibil- 
ity with sensible error bars. The estimate p = ^/5 MLE + 
11 is full-rank, but the estimator's honest uncertainty 
about p might well be greater than ±1%. Happily, the 
BME estimation procedure can easily be adapted to yield 
natural error bars, which are compatible with the point 
estimate. 

First, let us consider what form these error bars should 
take. Intuitively, the qualified estimate should look like 



p = p±Ap, 



(18) 



but what, precisely, is "Ap"? As p is a d x d matrix, 
we might suppose that Ap is also a d x d matrix, so 



Pij 



Pij ± Apij. This fails to account for covariance 



between distinct elements of p. For example, the diagonal 
elements of p must vary together to maintain Tr(p) = 1. 

The correct way to think about the estimator's uncer- 
tainty begins by representing the estimate, /5 B me, as a 
d 2 — 1 dimensional vector in Hilbert-Schmidt space. For 
a single qubit: 



TT(a x p) 
Tr(a y p) 
Tr(<7 2 p) 



(19) 



The estimator's uncertainty is represented as a symmet- 
ric covariance matrix on the same space: 



(Ax 2 Axy Axz \ 
Axy Ay 2 Ayz 
Axz Ayz Az 2 J 



(20) 



The elements of Ap involve two different expectation 
values: one with respect to the state, denoted (X) = 
Tr(Xp); and one with respect to the posterior probabil- 
ity, denoted / = J f(p)ir(p)dp. Using this notation, 



Ax 2 = 



(21) 



with the other elements given by the obvious generaliza- 
tion. 
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Represented as a covariance matrix, Ap quantifies the 
second cumulants of the estimator's probability distribu- 
tion Wf(p)dp. It defines an ellipsoid in Hilbert-Schmidt 
space, which is a credible interval (the Bayesian version 
of a confidence interval). The eigenvectors of Ap are op- 
erators that define the principal axes of this ellipsoid, and 
the corresponding eigenvalues are their lengths. 

As a matrix that acts on density matrices, Ap is a su- 
peroperator. It is symmetric and non-negative, but not 
completely positive or trace-preserving, so it cannot be 
interpreted as a quantum process. However, the superop- 
erator interpretation gives a formula for the estimator's 
uncertainty about the expectation value of a particular 
operator X. Defining Ap[A] to be the superoperator's 
action on X , 



A (X) 2 = Tr (A* Ap[X}) 



(22) 



quantifies the estimator's expected error in (X). 

Alternatively, Ap can be represented as an unnormal- 
ized symmetric bipartite state, 



A P 



p <x> p — pS3 p 



(23) 



p® p irf(p)dp (24) 



and in this representation, the estimator's expected error 
in (X) is 



A (X) 2 — Tt(X(® XAp) . 



(25) 



This Ap is a consistent representation of the estima- 
tor's uncertainty; for any X, it yields the same A (X) 2 
that an independent estimate of (X) would. To see this, 
let X be an arbitrary observable with eigenvalues be- 
tween x m ; n and x max . The variance computed via BME 
is: 



A (X) 2 = Tr(A (g) XAp) 



Tr 



X <g> X J p® p 7T/(p)dp 

-Tr [X (g> Xp BME (g) p BME ] 
J Tr[Ap] • 1r[Xp]-Kf{p)dp - Tr[Xp : 



BME J 

i 2 



(X) p n f (p)dp 



(26) 



Because (X) parameterizes exactly one of the dimensions 
of Hilbert-Schmidt space, we can compute a marginal 
distribution over (X) by integrating ivf(p)dp over its 
other d 2 — 2 dimensions, which we denote by a. Then 
dp = d(A")dcr, and 



n f ((X))d(X) ee / n f (p)dp, 
in terms of which, 



(27) 



A(xy 



(X) 2 n f ((X))d(X)- 



(X)n f ((X))d(X) 
(28) 



which is the familiar formula for the variance of the uni- 
variate distribution 7Ty ((A)). 

In particular, if |v>) is an eigenvector of p BM E, let X = 
\-4>)(ip\- Then A = (X) is the corresponding eigenvalue, 
and AA 2 = A (X) 2 is the reported uncertainty about it. 
Since (X) is between and 1 for all p, 7r/(A)dA is a dis- 
tribution over the interval [0 ... 1]. For any such distribu- 
tion, AA 2 < A(l — A), so every eigenvalue yields an upper 
bound for its own uncertainty. Note, too, that this bound 
is uniquely saturated by 7r(A) = (1 — p)S(X) + pS(l — A), 
which is maximally bimodal. In practice, well-behaved 
priors will produce convex posteriors, for which AA 2 < A 2 
(i.e., AA is no greater than A itself) can reasonably be 
expected. 



3. BME optimizes accuracy 

Above all else, an estimation procedure should yield 
an accurate estimate - one as close to the "true" state 
as possible. While the concept of a "true" state is prob- 
lematic in actual experiments, it makes perfect sense in 
the context of a simple game. An impartial judge selects 
a state p, and provides N copies of it to the estimator, 
who measures them and reports an estimate p. The best 
procedure is the one that consistently makes p as close 
as possible to the unknown [to the estimator] p. 

The point of this section is to show that BME is the 
most accurate scheme possible, in the sense that the ex- 
pected error between p and p is minimal. The argument 
presented here is brief; for more detail see [j| . This opti- 
mally holds for every finite A, not just asymptotically. 
It depends, of course, on the measure of "error" between 
p and p adopted. The error measures optimized by BME, 
operational divergences, are arguably the best-motivated 
such measures. 

Operational divergences, denoted A(p : p), measure 
how well the density matrix p describes (or estimates) 
the quantum state p. A certain subtlety should be noted 
here: whereas p represents the state of a quantum sys- 
tem, p is a classical description of a state - e.g., a density 
matrix written down on paper. Two natural require- 
ments constrain operational divergences. First, A must 
represent the outcome of some physically implcmentable 
process. Second, the best description of p had better be 
p itself. 

To satisfy operationality, we imagine trying to moti- 
vate the estimator to do a good job. A third-party ver- 
ifier, equipped with the estimate p, will perform a mea- 
surement on p. This measurement, V(p) = {Ei . . . E m }, 
is an arbitrary POVM that may depend on p. Depend- 
ing on the outcome (i), the verifier pays the estimator an 
amount ri(p). 

The estimator's reward is represented by an operator 



K(p) = J2r t (p)E t (p), 



(29) 
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and her expected reward (which she hopes to maximize) 
is 



r(p : p) = Tr(pK(p)), 



(30) 



The amount that she loses by inaccurately describing the 
state, 

A(p : p) = r(p : p) - r(p : p) = Tr [p (K(p) - K(p))] . 

(31) 

is an operational divergence. Note that: (1) it is opera- 
tionally significant; and (2) the best description of p is p 
itself. 

Of course, not every reward scheme is strictly proper, 
satisfying the condition that p be its own best estimate, 



r(p : p) > r(p : p) V p ^ p. 



(32) 



Equation [521 is a constraint on IZ(p). If we define G(p) = 
r(p : p) as the expected reward for a perfect estimate, 
then a bit of algebra yields 



G(p)>G(p)+Tr[(p-p)TZ(p)} 



(33) 



Eq. [33lholds if and only if: (1) r(p : p) (as a function of p) 
is tangent to G(p); and (2) G(-) is strictly concave. Thus, 
for for every strictly concave function G(-) on density 
operators, there is a unique operational divergence [30] |: 

A(p : p) = G(p) - G(p) - Tr \{p - p)VG(p)} , (34) 

where VG(-) is the gradient of G(-). 

Operational divergences include widely used measures 
such as the squared Hilbert-Schmidt or L2 distance, 



A 2 (p:p) =Tr[(p-p) 2 ] , 



(35) 



associated with G(p) = Tr(p 2 ); and the relative entropy 
or Kullback-Leibler divergence, 



Akl (p:p) = Tr [p (log p - log p)] , 



(36) 



associated with G(p) = —H(p) = Tr(p\ogp). 

Now that we have determined how to measure accu- 
racy, let's try to optimize it. This is an easy task for 
an omniscient estimator, because the best estimate of p 
is p itself. If the estimator actually knows p, then her 
best plan is to report p = p. The interesting case is an 
uncertain estimator. She must consider all the possible 
p, in order to choose the best p. A risk-neutral estimator 
seeks to maximize her expected reward, averaged over all 
possible p. 

Consider any estimation procedure, as a map from 
measurement records M. to estimates p(A4). Which pro- 
cedure should the estimator choose? Suppose that the 
unknown state p to be estimated will be drawn from 
an ensemble described by 7To(/o)dp. The expected reward 
yielded by the procedure p{M) is an average over (a) 
possible p, and (b) the ensuing A4. 

r= [ n (p)dpJ2p(M\p)r(p:p(M)) (37) 



Inserting r(p : p) = Tr \pR(p)] (Eq. [30]), 

Tr (p)dpJ2p(M\p)Tr{pK(p(M))}. (38) 



M 



The trace, sum, and integral are all linear, so we can 
rearrange them as 



M 



pp(M\p)n (p)dp\n(p(M)) 



(39) 



We now observe that J p(M\p)iro(p)dp = p(M), the un- 
conditional probability of observing M.. Furthermore, 
/ pp(M\p)iro(p)dp = Pbme(-A^), the BME estimate given 
ttq. Using these identities, the estimator's expected re- 
ward is 

r = ^p(M)Tr[p BME (M)^( j 5(X))] (40) 

M 

= ^p(M)r(p BME (X) : p(M)), (41) 

M 

and each term in the sum can be independently maxi- 
mized. For each M, the optimal p(M) is Pbme — which 
means that BME is unconditionally the optimal estima- 
tion procedure. 

This result is remarkable because it makes no appeal to 
asymptotics; the optimality holds for 100, 10, or even just 
1 observation. Of course, when the estimator has insuffi- 
cient data, the resulting estimate will not be very accu- 
rate - but neither will any other estimate. Crucially, her 
uncertainty will be reflected in a highly mixed estimate, 
with large error bars. Unlike MLE, BME fails gracefully, 
making the best use of the available data without over- 
reaching. 

Two points should, however, be kept in mind. First, 
BME is not necessarily optimal according to standards 
that are not operational divergences - e.g., trace-distance 
or fidelity. Measuring the performance of an estimation 
algorithm by these standards is generally unwise, but 
they are commonly misused in this way (especially fi- 
delity). Second, the optimality proof assumes that the 
estimator's prior coincides with the ensemble from which 
the unknown states were selected. A sufficiently wrong 
prior will lead to horrendous results. Further research 
into uninformative priors, and techniques for selecting 
priors, may alleviate this problem. 



D. Bayesian and Frequentist approaches 

Having examined both frequentist and Bayesian ap- 
proaches, I have focused on the concrete details - how 
does MLE fail? why does BME do better? how is BME 
done? - because estimation is an operational task. Cer- 
tain readers may, however, ask "What's wrong with fre- 
quentism, anyway?" Others may be wondering what 
really distinguishes Bayesian and frequentist methods, 
since C{p) is crucial to both. I attempt to address these 
questions below. 
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1. Why frequentism fails 

The frequentist approach has dominated statistics for 
most of the 20th century, so its failure in quantum state 
estimation requires some explanation. To see why fre- 
quentism fails, we might first ask why it should succeed. 

MLE attempts to fit the observed data, and so the 
MLE estimate is the best "predictor" of the past. Since 
the goal of a state estimate is to predict the future, fre- 
quentist estimation techniques can be justified by the fol- 
lowing axiom: the future will look [statistically] identical 
to the past. If this axiom is true, then /5 M le is the best 
possible estimate. The Law of Large Numbers implies 
its validity as N — > oo, and the Central Limit Theorem 
quantifies this convergence. 

For classical systems, it is always possible that the Fre- 
quentist Axiom will hold. If the coin comes up "heads" 
the first time, it's entirely possible that it will always 
come up heads. Moreover, the rules are not going to 
change - the possible outcomes in the past were "heads" 
and "tails," and they will remain the only possible out- 
comes in the future. 

This doesn't hold for quantum systems. The past, rep- 
resented by the estimator's data, comprises a finite set of 
observations extracted from a finite variety of measure- 
ments. For instance, the estimator might have measured 
a x , a y , and a z on a qubit. Future experimenters, how- 
ever, might choose to measure any observable - and there 
arc infinitely many. A quantum state, by definition, pre- 
dicts the probabilities for every possible measurement. 
The frequentist axiom cannot possibly hold; any future 
observer could violate it at will, simply by making a novel 
measurement. 

Frequentist methods for classical probabilities yield 
zero probabilities only when 

(a) event "i" has never been observed, 

(b) in every trial, something in the complement of event 
"i" was observed. 

That is, event "i" could have happened, but it didn't. 
When MLE is used on quantum systems, the \<pi)(<pi\ that 
ends up getting assigned zero probability is almost never 
something that could have been observed. The Achilles' 
heel of frequentist quantum estimation is that it happily 
assigns zero probability to events that were never ob- 
served not to happen. To avoid this problem, we need a 
method that does not begin by assuming "the future will 
look like the past," because for a quantum system, that 
can't be true. 



2. How the Bayesian and frequentist approaches differ 

C(p) is the key ingredient in Bayesian methods, just 
as in frequentist ones. It represents everything relevant 
about the data. In frequentist methods, C(p) is the sole 
ingredient, and so the only natural thing to do is to find 



the p that maximizes it. Bayesian methods, in contrast, 
transform the likelihood into a probability distribution, 

L{p) — TT( P )dp (X £(p)TT a (p)dp, (42) 

by multiplying it by a prior distribution TTo(p)dp. 

A common misconception is that this transformation 
is trivial when TTo{p)dp is "flat" (e.g., coincides with a 
Lebesgue or Haar measure). On the contrary, it trans- 
forms a function into a distribution (or measure), which 
is an entirely different mathematical object. Functions, 
like C(p), have values. Distributions have integrals - they 
assign values not to points, but to regions. 

For example, if C{x) is defined for real-valued x, then 

£(0) and £(1) are well-defined, but J Q C(x) is purely 
meaningless. To integrate, we must multiply by dx (a 
measure), obtaining a distribution £(x)dc. This can be 
integrated over the interval [0, 1] - but evaluating C(x)dx 
at x — is ill-defined (and infinitesimal in any case). 

This difference between functions and distributions en- 
forces a difference in approach between frequentist and 
Bayesian methods. Frequentists, abjuring priors, can 
only work with the function C(p). The corresponding 
estimate, po, will be distinguished by the value of C(po). 
The Bayesian approach begins and ends with a distribu- 
tion, which has no values. Everything must be phrased 
in terms of measurable subsets (e.g., intervals), and inte- 
gration over them. 

Estimation algorithms transform data (observed in the 
past) into an estimate (which predicts the future). In or- 
der to select the best estimate, we must logically connect 
the past and the future. Frequentist methods implic- 
itly use the frequentist axiom, while Bayesian approaches 
take a weighted average over all possible theories. This 
averaging is particularly apropos for quantum state esti- 
mation, because density matrices have a natural convex 
structure. Suppose a physicist knows that a qubit's state 
is |o) with probability -| and |i) with probability |. He 
will describe it by the average state, p = |(|o)(o| +2 |i)(i|) 
- not the most likely state, p = |i)(i|. 

Viewed this way, the prior replaces the frequentist ax- 
iom as a connection between past and future. This can 
be an advantage, for a Bayesian is capable of gracefully 
acknowledging that the data are not descriptive of the 
true state - that they are unlikely, or simply insufficient. 
However, the price paid for this flexibility is the need to 
choose a prior, often without any good justification. 



III. WHERE DO WE GO FROM HERE? 

The Bayesian approach to state estimation has unde- 
niable advantages. It is accurate, it honestly represents 
the estimator's knowledge, and it conforms to quantum 
states' role as predictors. Purely frequentist approaches - 
e.g., maximum likelihood as it is currently used - cannot 
match these qualities. 
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Nonetheless, BME comes with an array of concomitant 
challenges. These range from the purely practical (inte- 
gration is hard) to the fundamental (how do we choose 
a prior?). While some are specific to Bayesian methods, 
others cast doubt on the scalability of any state estima- 
tion procedure. 



A. The Prior's Tale 

Of all the problems and caveats raised by BME, none 
is more pressing or obvious than "How do we choose 
a prior?" BME's optimality depends on the estima- 
tor's prior matching the "true" distribution of unknown 
states. This is fine in the rather artificial context of a 
state-estimating game that might be played many times, 
but physics experiments aren't drawn from an ensemble. 
Each experiment is, as a rule, unique. 

The prior is therefore a necessary fiction. As a conve- 
nient way of representing the estimator's ignorance (ei- 
ther genuine, or assumed for the sake of scientific im- 
partiality), it ought to be as uninformative as possible. 
Unitary invariance is a good first guideline (see Section 
IIII CI below, however). Over the spectrum of p, however, 
no uniquely suitable measure exists. Identifying particu- 
larly useful and non-informative priors remains an open 
and urgent question. 

A related open question is "What is the penalty for 
choosing the wrong prior?" If accuracy is measured by 
an operational divergence, then BME must outperform 
MLE and all other methods - if the estimator's prior 
matches the distribution of unknown states. Its robust- 
ness to a poor prior is unknown. The optimality proof 
given previously is elegant in its simplicity, but precisely 
because of that elegance, it provides few clues to this 
problem. 



B. Practical matters 

Every calculus student learns that integration is harder 
than differentiation. Numerical integration is an ac- 
tive and challenging field of numerical analysis, whereas 
differentiation involves little more than function eval- 
uation. BME consists almost entirely of integration, 
whereas MLE is a maximization problem. Unsurpris- 
ingly, the implementation of BME described above is 
roughly an order of magnitude slower than MLE. Ex- 
perimentalists, already frustrated by MLE analyses that 
run for a week or more [3lj , may be nonplussed. 

This state of affairs may owe a great deal to the fact 
that MLE algorithms, unlike BME, have been developed 
and used for 5-10 years. Substantial speedups are likely 
in the future - precisely because numerical integration 
remains something of an art. The Metropolis-Hastings 
algorithm already provides a tremendous advantage over 
naive Monte Carlo, so a few more orders of magnitude 
may be feasible. 



One reason for optimism is that the BME integral 
appears, in principle, to have a rather simple analytic 
form. The likelihood function is a polynomial, the prod- 
uct of many linear functions, of the form Tr(pMj). For 
certain priors (e.g., Hilbert-Schmidt) the resulting pos- 
terior looks a lot like a beta distribution of the form 
(3(x) — x n (l — x) N ~ n . This appears in classical estima- 
tion, and is easy to integrate. What makes the quantum 
case hard is the boundary conditions. Unlike the classi- 
cal probability simplex, the quantum state-set has curved 
edges that are awkward to integrate over. However, ana- 
lytic solutions can be obtained for small N, and a general 
solution might be possible. 

C. Scalability 

Quantum devices exist that provide coherent control 
over 8 to 12 qubits d, Twenty or thirty qubits will 
probably be controlled within the next five years (if only 
for a short time, and with limited fidelity). The Hilbcrt 
space of a 30-qubit quantum register is enormous - to 
merely store one density matrix for such a device would 
require just under 1 million terabytes of memory. State 
estimation, as we know it, is impossible in this context. 

Nonetheless, characterizing quantum hardware will re- 
main important. A quantum computer will not need 
state estimation; its results will appear as a computa- 
tional basis state, determined by a projective measure- 
ment. Development and testing of components, however, 
will depend crucially on state estimation. It is not suf- 
ficient to know whether or not the desired state is pro- 
duced; the designer will want to know the nature of the 
errors, so as to correct them. Eventually, these errors 
need to be reduced below a fault-tolerance threshold that 
is probably less than 10~ 3 . 

As the states that are estimated grow larger, and the 
uses to which they are put become more demanding, 
utterly new techniques will be needed. Unbiased esti- 
mation - i.e., guessing the system's state without any 
pre-existing assumptions - becomes exceedingly data- 
intensive for large Hilbert spaces. Making use of the esti- 
mator's prior knowledge will be essential. The Bayesian 
approach presented here provides a natural framework 
for doing so. However, a framework for reliably repre- 
senting that prior knowledge (without falling prey to self- 
fulfilling prophecies) will be necessary. This reason alone 
would justify further study of Bayesian state estimation. 
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APPENDIX A: NECESSARY AND SUFFICIENT 
CONDITION FOR A PRIOR'S ROBUSTNESS 

Theorem 1. A prior 7To(p)dp overdxd density operators 
is robust (and therefore guaranteed to generate full-rank 
estimates for any finite measurement record) if and only 
if its support in Hilbert- Schmidt space is not a subset of 
a finite intersection of ((<£ — l) 2 — 1)- dimensional hyper- 
planes that are tangent to the state-set. 

Proof: A prior is fragile if and only if it can be 
annihilated by a some finite-length measurement record: 
i.e., there exists M — {Mi,M 2 , ■ ■ ■ M N } so that C(p) — 
Yli—i Tr(Mip) is zero on the prior's support. C(p) is zero 
if and only if, for some i, Tr(Mip) = 0. Each Mj thus 
eliminates every p supported on M^s null space, which 
has at most (d — 1) dimensions. The Hermitian trace- 
1 matrices supported on Mj's null space form a (d — 
l) 2 — 1 dimensional hyperplane in Hilbert-Schmidt space. 
This hyperplane contains non-negative states, which are 
necessarily orthogonal to Mi, and therefore lie on the 
boundary of the state set. Thus, M, eliminates all density 
matrices lying within a hyperplane which includes states, 



but does not include full-rank states - i.e., a hyperplane 
that is tangent to the state-set. The states eliminated 
by M. are, therefore, merely the intersection of N such 
hyperplanes, and if the prior's support does not lie within 
such an intersection, it cannot be eliminated. 

Conversely, suppose that the prior's support does lie 
within an intersection of N such tangent hyperplanes. 
Each hyperplane is closed under convex combination, so 
we can define a convex combination of every non-negative 
element, po, which is itself an element of the hyperplane. 
Since the hyperplane is tangent to the state-set, no el- 
ement can lie in the interior, and so po is not full-rank 
- i.e., it is orthogonal to some |i/>)(?/>|. Since po is a con- 
vex combination of every state in the hyperplane, the 
entire hyperplane is orthogonal to |v , )(^|) and is therefore 
eliminated by observing A measurement record 

consisting of the annihilating projectors for each of the 
N hyperplanes will therefore annihilate the prior, so it is 
fragile. □ 

Corollary 2. Any prior with support on a smooth curve 
in at least (d — l) 2 dimensions is robust. 

Proof: Since the curve occupies at least (d — l) 2 
dimensions of Hilbert-Schmidt space, it cannot be con- 
tained in a ((d — l) 2 — l)-dimcnsional hyperplane. If it 
could be contained in a finite union of such planes, then 
it would not be smooth. □ 
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